Introduction



Airport delays are one of the most common problems people face upon travelling. These delays are usually associated with certain carriers or destinations. Thus the question is, are they really the cause of these delays or is it due to other reasons?

In 2013, data about 336,776 flights were collected to answer this question, the data collected were about flights departing from New York City across all of its airports: John F. Kennedy International Airport (JFK), Newark Liberty International Airport (EWR) and LaGuardia Airport (LGA) to destinations all over the United States and some of its territories (Puerto Rico, and the American Virgin Islands). The data was described by 19 attributes shown in Table 1 followed by a sample of the data in Table 2.


Goal

The aim behind this analysis is to confirm whether there is a relationship between New York City’s flight delays and attributes of the dataset. We hypothesize that flight distance and destination are major contributors to the delays and that different destinations with different distances will have less delays.


Methods

  • Data Visualization
  • Exploratory Data Analysis (EDA)
  • Data Munging


The Dataset

Attribute Description
year Year of departure.
month Month of departure.
day Day of departure.
dep_time Actual departure time (format HHMM or HMM), local time zone.
arr_time Actual arrival time (format HHMM or HMM), local time zone.
sched_dep_time Scheduled departure time (format HHMM or HMM), local time zone.
sched_dep_time Scheduled arrival time (format HHMM or HMM), local time zone.
dep_delay Departure delay, in minutes. Negative times represent early departures.
arr_delay Arrival delay, in minutes. Negative times represent early arrivals.
carrier Two letter carrier abbreviation.
flight Flight number.
tailnum Plane tail number.
origin Flight origin
dest Flight destination.
air_time Amount of time spent in the air, in minutes.
distance Distance between airports, in miles.
hour Hour of scheduled departure.
minute Minutes of scheduled departure.
time_hour Scheduled date and hour of the flight as a POSIXct date.

Table 1: A list of all attributes and its descriptions that are present in the dataset


Table 2: A sample of the dataset and how it is structured



Now lets take a look at our data:

##           year          month            day       dep_time sched_dep_time 
##              0              0              0           8255              0 
##      dep_delay       arr_time sched_arr_time      arr_delay        carrier 
##           8255           8713              0           9430              0 
##         flight        tailnum         origin           dest       air_time 
##              0           2512              0              0           9430 
##       distance           hour         minute      time_hour 
##              0              0              0              0
## Rows: 336,776
## Columns: 19
## $ year           <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2…
## $ month          <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ day            <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ dep_time       <int> 517, 533, 542, 544, 554, 554, 555, 557, 557, 558, 558, …
## $ sched_dep_time <int> 515, 529, 540, 545, 600, 558, 600, 600, 600, 600, 600, …
## $ dep_delay      <dbl> 2, 4, 2, -1, -6, -4, -5, -3, -3, -2, -2, -2, -2, -2, -1…
## $ arr_time       <int> 830, 850, 923, 1004, 812, 740, 913, 709, 838, 753, 849,…
## $ sched_arr_time <int> 819, 830, 850, 1022, 837, 728, 854, 723, 846, 745, 851,…
## $ arr_delay      <dbl> 11, 20, 33, -18, -25, 12, 19, -14, -8, 8, -2, -3, 7, -1…
## $ carrier        <chr> "UA", "UA", "AA", "B6", "DL", "UA", "B6", "EV", "B6", "…
## $ flight         <int> 1545, 1714, 1141, 725, 461, 1696, 507, 5708, 79, 301, 4…
## $ tailnum        <chr> "N14228", "N24211", "N619AA", "N804JB", "N668DN", "N394…
## $ origin         <chr> "EWR", "LGA", "JFK", "JFK", "LGA", "EWR", "EWR", "LGA",…
## $ dest           <chr> "IAH", "IAH", "MIA", "BQN", "ATL", "ORD", "FLL", "IAD",…
## $ air_time       <dbl> 227, 227, 160, 183, 116, 150, 158, 53, 140, 138, 149, 1…
## $ distance       <dbl> 1400, 1416, 1089, 1576, 762, 719, 1065, 229, 944, 733, …
## $ hour           <dbl> 5, 5, 5, 5, 6, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 5, 6, 6, 6…
## $ minute         <dbl> 15, 29, 40, 45, 0, 58, 0, 0, 0, 0, 0, 0, 0, 0, 0, 59, 0…
## $ time_hour      <dttm> 2013-01-01 05:00:00, 2013-01-01 05:00:00, 2013-01-01 0…
##       year          month             day           dep_time    sched_dep_time
##  Min.   :2013   Min.   : 1.000   Min.   : 1.00   Min.   :   1   Min.   : 106  
##  1st Qu.:2013   1st Qu.: 4.000   1st Qu.: 8.00   1st Qu.: 907   1st Qu.: 906  
##  Median :2013   Median : 7.000   Median :16.00   Median :1401   Median :1359  
##  Mean   :2013   Mean   : 6.549   Mean   :15.71   Mean   :1349   Mean   :1344  
##  3rd Qu.:2013   3rd Qu.:10.000   3rd Qu.:23.00   3rd Qu.:1744   3rd Qu.:1729  
##  Max.   :2013   Max.   :12.000   Max.   :31.00   Max.   :2400   Max.   :2359  
##                                                  NA's   :8255                 
##    dep_delay          arr_time    sched_arr_time   arr_delay       
##  Min.   : -43.00   Min.   :   1   Min.   :   1   Min.   : -86.000  
##  1st Qu.:  -5.00   1st Qu.:1104   1st Qu.:1124   1st Qu.: -17.000  
##  Median :  -2.00   Median :1535   Median :1556   Median :  -5.000  
##  Mean   :  12.64   Mean   :1502   Mean   :1536   Mean   :   6.895  
##  3rd Qu.:  11.00   3rd Qu.:1940   3rd Qu.:1945   3rd Qu.:  14.000  
##  Max.   :1301.00   Max.   :2400   Max.   :2359   Max.   :1272.000  
##  NA's   :8255      NA's   :8713                  NA's   :9430      
##    carrier              flight       tailnum             origin         
##  Length:336776      Min.   :   1   Length:336776      Length:336776     
##  Class :character   1st Qu.: 553   Class :character   Class :character  
##  Mode  :character   Median :1496   Mode  :character   Mode  :character  
##                     Mean   :1972                                        
##                     3rd Qu.:3465                                        
##                     Max.   :8500                                        
##                                                                         
##      dest              air_time        distance         hour      
##  Length:336776      Min.   : 20.0   Min.   :  17   Min.   : 1.00  
##  Class :character   1st Qu.: 82.0   1st Qu.: 502   1st Qu.: 9.00  
##  Mode  :character   Median :129.0   Median : 872   Median :13.00  
##                     Mean   :150.7   Mean   :1040   Mean   :13.18  
##                     3rd Qu.:192.0   3rd Qu.:1389   3rd Qu.:17.00  
##                     Max.   :695.0   Max.   :4983   Max.   :23.00  
##                     NA's   :9430                                  
##      minute        time_hour                     
##  Min.   : 0.00   Min.   :2013-01-01 05:00:00.00  
##  1st Qu.: 8.00   1st Qu.:2013-04-04 13:00:00.00  
##  Median :29.00   Median :2013-07-03 10:00:00.00  
##  Mean   :26.23   Mean   :2013-07-03 05:22:54.64  
##  3rd Qu.:44.00   3rd Qu.:2013-10-01 07:00:00.00  
##  Max.   :59.00   Max.   :2013-12-31 23:00:00.00  
## 



EDA

Although 3 airports per city may seem large, larger than many of the world’s capitals, the city that never sleeps hosts only 3 out of 16 airports of the state of New York which itself is not the host of the largest number of airports per state as shown in Figure 1.


Figure 1: An interactive map showing the number of airports per state. Puerto Rico (7 airports) and U.S. Virgin Islands (2 airports) are not shown.

Based on the figure we can observe that:


The three airports of the city of New York serve different purposes–such as domestic or international–and hence should have different number of flights, let us confirm that using Figure 2.

Figure 2: Number of flights per NYC airport

From the figure above, we can see that all of the airports have relatively similar numbers with LaGuardia Airport (LGA) having the lowest, probably due the fact that it is a domestic airport. Newark Liberty International Airport (EWR) has the largest share of flights, which can be explained by looking at its location, where it lays on the border between New York state and New Jersey state, making it a strategic location and more favourable over JFK International Airport.


And despite the large number of flights and airports around the United States, Figure 3 showed that some states were never reached from NYC airports during 2013.


Figure 3: Map showing flights per state as destination

The map shows us that the most visited destination is Florida, followed by California by almost half of the number. The map also shows that there are 8 states with zero flights towards it namely: Mississippi, Kansas, Idaho, New Hampshire, New Jersey, Delaware, South and North Dakota.


Figure 4 will show us the delays per airport:

Figure 4: Number of delays per NYC airport

The figure features positive and negative points–indicating departure/arrival was before time or ahead of scheduled time. The delay average across all airports is +9.7 minutes. Although each airport shows a different average, it is not enough to say that one airport will have a certain delay time.

To further investigate that, let us take a look at the attributes and how they affect each other.

Figure 5: Correlation matrix of the dataset attributes

Here we see the correlation matrix between the main numeric attributes.


Although the correlation matrix showed us the relationship between many variables, it did not mention a very important aspect of flights, time.

Figure 6: Departure and ARRIVAL delays per hour


The first plot on the left indicate that the lowest number of departure delays occur on the early hours of day at 5 am, and late at night at 10-11pm. And the worst time is between 4-7 pm


The second plot on the right indicate similar result to the previous one, with the lowest number of arrival delays occur on the early hours of day at 5 am, and late at night at 10-11pm. And the worst time is between 4-7 pm


Figure 7: Departure and Arrival delays per month


We can see from the first plot that the highest number of departure delays occur on month 6 (Jun), 7 (Jul), and 12 (Dec), which indicate there are more departure delays during summer and winter breaks.


Similar to the other plot, in the second plot we see highest number of arrival delays occur on month 7 (Jul), and 12 (Dec) during the summer and winter breaks.



Figure 8: Number of flights and their departure delays month


In the first plot on the left, we can see that as the number of flights increases, the number of departure delays also increase. If we check for the months with the highest delays, 6 (Jun), 7 (Jul), and 12 (Dec). We see they also have the highest number of flights compared to other months.

Figure 9: Number of flights and their arrival delays month


The second plot is similar to the first where the number of flights increases, the number of Arrival delays also increase. Except that for month 8 (Aug), where the number of flights were high but the arrival delays were relatively lower than months with less flights.


Figure 10: Percentage of flights’ departure and arrival delays

We can see that almost 39% of the NYC flights in the year 2013 had a departure delay, only 5% departed on time and 55.9% departed before time. As for the arrival delays, we can see that it doesn’t differ that much from the departure delays.


carrier no_flights low_delay medium_delay high_delay overall_delay
UA 57782 26% 14% 7% 47%
B6 54049 17% 14% 8% 40%
EV 51108 15% 16% 13% 45%
DL 47658 16% 10% 6% 32%
AA 31947 16% 9% 6% 32%
MQ 25037 11% 13% 8% 32%
US 19831 12% 8% 4% 24%
9E 17294 15% 14% 11% 40%
WN 12044 27% 17% 9% 54%
VX 5116 26% 10% 7% 43%
FL 3175 25% 16% 10% 52%
AS 709 18% 7% 6% 32%
F9 681 22% 16% 11% 50%
YV 544 14% 14% 14% 43%
HA 342 13% 4% 3% 20%
OO 29 10% 7% 14% 31%

Table 3: Carriers, their number of flights and the percentage of flights with delays

WN carriers tend to have the highest percentage of overall delays, and US carriers tend to have a low percentage of delays compared to its number of flights.



Figure 11: Carriers, their number of flights and the percentage of flights with delays

From the horizontal stacked bar it’s clear that carriers UA, EV, B6 and DL have the highest frequency of delays, also these carriers have the highest number of flights, which can tell us that carriers having a high number of flights tend to have a high frequency of delays.


carrier highest_delay avg_delay
F9 853 20.201175
EV 548 19.838929
YV 387 18.898897
FL 602 18.605984
WN 471 17.661657
9E 747 16.439574
B6 502 12.967548
VX 653 12.756646
OO 154 12.586207
UA 483 12.016908
MQ 1137 10.445381
DL 960 9.223950
AA 1014 8.569130
AS 225 5.830748
HA 1301 4.900585
US 500 3.744693

Figure 12: Carriers and their maximum and average delays

The carrier with the highest delay time is HA with 1301 min delay and the carrier with the highest avg delay is F9 with an average of 20.2


Conclusion

After conducting the EDA, we have found out that delays on average do not exceed 15 minutes. The source of the delays vary and is related to multiple factors, notably: date and time of the flight and the flight carrier.


Resources

Wickham H (2022). nycflights13: Flights that Departed NYC in 2013. R package version 1.0.2, https://github.com/hadley/nycflights13.


Source Code

This report is hosted on Github Pages and the repo can be accessed via this link.